9-reinforcement-learning-cover Cyberbotics Ltd. / CC BY-SA

What is reinforcement learning?

In previous lessons, we considered how an algorithm could be used to train an artificial neural network when the programmer has access to a large set of training data in the form of pairs of a known input corresponding to a target output. A problem occurs when this training data is unavailable or the application for the artificial intelligence is so complex that building any complete training data set is impractical. A potential method to employ AI in these conditions is by means of reinforcement learning (RL).

Reinforcement learning is a goal-oriented mode of training an AI - the AI learns by modifying the way it functions towards attaining a stated objective. The basic mechanism for training is that the AI considers an agent, which acts out a strategy and receives either a ‘reward’ or ‘penalty’ (negative reward) depending on how successful that strategy was at attaining the objective. The AI trials agents with various strategies and converges towards that which delivers the most reward. The algorithm used in reinforcement learning seeks to maximise the cumulative reward and in doing so it modifies its function to an optimal goal attaining strategy.

Note that the AI designer does not tell the agent how it should function but simply creates reward criteria. In doing so, it is up to the algorithm itself to determine how the agent should perform in a given task. The final strategy is often not the conventional strategy that would be employed by a human expert and in some famous cases, the algorithms have devised unintuitive strategies that outperform the strategies previously accepted by human experts.

Common terms in reinforcement learning

Before continuing, it is important to define some of the common terms used when describing reinforcement learning:

Agent

The agent is the entity that performs the action within the environment, changing from one state to a different state.

Action (a)

The action is a move that an agent can make, which changes the relation of the agent within the environment. Generally, agents select from a list of possible actions, constrained by the environment.

Environment (e)

The environment is the scenario faced by the agent and imposes constraints on the actions available for an agent to make. An example might be a maze where the environment describes the walls and pathways.

State (s)

The state refers to the current configuration of the environment and the relation of the agent within it. An example might be the configuration of a chessboard.

Reward (r)

The reward is a value returned to the algorithm inferring the positive or negative impact of its action. It is a measure of performance in attaining a goal.

Value (v)

In a value-based method of learning* the value is a prediction of the cumulative reward, i.e. it is the expected long-term reward based on predicted future actions.

Policy (π)

In a policy-based method of learning* the policy is akin to a strategy that governs the actions of the agent.

Model of the Environment

In a model-based method of learning* the environment is emulated in software for training purposes. This emulation is termed the model of the environment and is intended to mimic the behaviours of the true environment.

* The various methods of learning shall be described in more detail later in this lesson.

The reinforcement learning mechanism and algorithms

The mechanisms and algorithms in reinforcement learning are often very complex but in its simplest form reinforcement learning is quite simple, summarised in the following illustration.

Imagine that you want to teach your cat to play. Obviously, you cannot simply communicate the goal to your cat directly and so you place the cat on the floor and try to play with it. If the cat plays as desired, you give her a fish and if not, she does not get the fish. Maybe over time, the cat learns how to play – it has learned to do an action because it values the rewards it receives.

Cat learning by reinforcement

Figure 1: The mechanism of reinforcement learning with a cat

In this scenario:

the cat is an agent
she has been placed on the floor, the environment.
The cat starts in one state (sitting).
The cat takes an action to a new state (playing).
The cat is given a reward for this action and thus recognises it as good.
During training, the cat might try different actions, e.g. scratching, and she is not rewarded.
Over time she learns that playing is the best strategy to optimise rewards and this governs the cat’s future behaviour.

If we consider the agent and the environment as transfer functions:

the environment transforms the current state and current action (inputs) into the next state and a reward (outputs).
the agent transforms the state and its associated reward (inputs) into the next action (output).

The above conception can be summarised in the following flow chart:

reinforcement learning flow chart

Figure 2: Reinforcement learning flow chart

Reinforcement learning algorithms

There are three common algorithms used when creating a reinforcement learning mechanism:

Value-based Algorithms

In this learning method, the algorithm seeks to maximise the value function ‘V(s)’, defined as the predicted value from the current state.

Policy-based algorithms

In a policy-based learning method, the algorithm seeks to establish a policy ‘π’, a strategy that when employed would maximise the cumulative reward from any given state. Policy-based algorithms may be:

Deterministic – from any given state, the same action is taken each time, or
Stochastic – from any given state, a range of actions are available (normally selected based on a probability function)

Model-based algorithms

A virtual model is created of the environment and the agent learns how to optimise performance by simulating actions in the model. The learned behaviour is then employed in the environment and the effectiveness is determined by how well the model approximates the environment.

Note that each of the above algorithms share some common characteristics:

There is no supervisor, the programmer defines the performance and reward criteria whilst the learning algorithm measures success and develops a strategy.
Decision making is sequential with each decision taking a time step to a new state.
Feedback is delayed; the agent is informed by an assessment of how well previous actions have performed. Feedback cannot be instantaneous, i.e. based on current actions.

Positive or negative reinforcement

Reinforcement learning may employ either positive or negative learning mechanisms:

A positive learning mechanism rewards those agent’s actions that impact positively in the environment. This type of learning seeks to optimise performance but can result in agents being trapped in a dominant and inflexible policy.
A negative learning mechanism penalises those agent’s actions that impact negatively in the environment. This type of learning seeks to avoid poor performance but may only achieve a base level of acceptable behaviour in an environment, without ever optimising performance.

Learning models

There are many learning models employed in reinforcement learning, but the two most well-known models are the Markov Decision Process and Q Learning:

The Markov Decision Process

The Markov Decision Process (MDP) is a stochastic policy-based algorithm working by applying a mathematical framework to decide which action to take from a set of possible actions available from a set of states within the environment. This decision is made based on the predicted reward received from transitioning between states. The goal of the algorithm is to find an optimal policy (π*) for the agent that governs which action to take in any given state to optimise cumulative reward.

Q-Learning

Q-Learning is a value-based algorithm, which assigns a value to each transition between states. Note that the environment governs which state transitions can occur. The programmer tells the algorithm what the goal is by assigning a high value to those state transitions and assigns a low or negative value to all other transitions. A Q-Learning algorithm builds a matrix of all possible successive transitions and selects the highest cumulative value to govern future actions. This process is illustrated in the following simple maze example.

Say a robot starts in location A of the maze and the goal is to reach location F:

Q-learning in a maze_1

Figure 3: Q-Learning in a maze

Without assessing all possible iterations, a simple Q-learning algorithm should be able to compare the following possible routes through the maze:

State (S)	S	S+1	S+2	S+3	S+4	S+5	Total
Route 1 (Value)	A (0)	D (-1)	E (-1)	B (-1)	C (-1)	F (10)	6
Route 2 (Value)	A (0)	D (-1)	E (-1)	F (10)			8

Optimising for the maximum cumulative value, the Q-Learning algorithm should select Route 2 as being better than Route 1. Therefore, when in the state of location E, the algorithm should govern actions to move the agent to the state of location F instead of location B.

Applications of reinforcement learning

As we have seen, the learning mechanism employed in reinforcement learning is significantly different to supervised forms of learning such as we have considered previously. As such artificial intelligence that utilises reinforcement learning normally finds different applications. Some common applications are listed below:

Autonomous navigation through an environment
Robotics
Autonomous gaming systems
Traffic management systems
Task allocation (resource management) in networked computer systems
Personalised recommendation engines

In general, in those applications for which the range of states is either so large or unpredictable that a programmer is unable to build an adequate training set, reinforcement learning may be considered as an alternative solution.

However, where adequate training data can be created or where modelling of the environment is deemed too complex, reinforcement learning may be deemed as an ineffective solution.

Summary

In this section, we have introduced the method of reinforcement learning for artificial intelligence. In summary:

RL is a form of unsupervised learning in which the AI interacts with the environment instead of interacting with a set of training data.
RL works by deciding upon which action to take to maximise a reward function.
3 common methods of RL are: Value-based, Policy-based and Model-based learning.
Rewards may be positive, negative or a combination of both.
Two common models are: the Markov Decision Process and Q-Learning.

Reinforcement Learning